Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nifi Atlas Bridge Development Opportunity #1

Closed
gjlawran opened this issue Jul 8, 2017 · 5 comments
Closed

Nifi Atlas Bridge Development Opportunity #1

gjlawran opened this issue Jul 8, 2017 · 5 comments

Comments

@gjlawran
Copy link
Member

gjlawran commented Jul 8, 2017

Background:

Is Java one of your languages of choice? Are you familiar with Apache NiFi (data transformation and routing tool ) and/or Apache Atlas? If so, check out this opportunity to work with the Ministry of Jobs, Training and Technology.
Tags: Java, Apache NiFi, Apache Atlas, Metadata, HortonWorks
Amount: $10,000.00 CAD

Description:

We are looking for help building an Apache NiFi-Atlas bridge. We want it to process NiFi provenance data and log this in Apache Atlas as lineage metadata to support our planned use of HortonWorks Data Governance framework.

More specifically, we want to be able to record in Atlas the "logical" lineage of a file from input all the way to being saved to HDFS or a database even though it was manipulated several times. Also, we would like to be able to process the provenance data as the processes proceed and not have to wait until the flow is finished. This would allow Atlas to show the current status of a file before the whole flow is finished.

We are aware of the code at https://github.com/vakshorton/NifiAtlasBridge and https://github.com/vakshorton/NifiAtlasLineageReporter, however it does not seem to do exactly what we want. The one problem with the above code is that it generates Atlas metadata as input and output from too many flow ingress and egress events. Many processors change or clone data, which means the FlowFile's id changes, which then in turn causes the metadata lineage to be broken in Atlas.

A typical simplified use-case would be to get a ZIP file from the file system, move it to a different directory based on filename and date, do various actions on it - unzip as CSV, split, join, update attributes, custom processors, manually edit it, save to a different directory, infer Avro schema, convert CSV to Avro, save file, convert file to orc, put in HDFS, generate Apache Hive DDL, create Hive table. The file names and directory names will not be hard-coded in the flow but will instead be parameter driven. In this scenario, we would like to be able to trace that Hive table lineage all the way back to the ZIP file input.

Acceptance criteria:

A merge request from your GitHub account to this repo (https://github.com/bcgov/nifi-atlas) that works with the following software versions:
• Java 8
• Apache Atlas 0.8
• Apache NiFi 1.3+

• The solution must work in a clustered environment with provenance data coming from several nodes. We would like it to work in a site-to-site environment (cluster A does some processing, then hands it over to cluster B to do further processing - we would like to be able to have the lineage follow all the way through).

• It is acceptable for the code to work with a local file system, even though we will implement it in the cloud using e.g. Azure, AWS etc.

• The code must allow us to add more processors and custom processors quite easily.

• The code needs to include a build script (Maven or Gradle) to compile the Java code and produce a NAR file.

• The code needs to include some basic unit testing, as well as an example of use in a Dataflow template to demonstrate end-to-end functionality.

• This code does not need to be production ready, but must be a good starting point for us to use.

• The code needs to be commented where possible, especially in the areas of provenance processing.

How to apply:

To apply, please visit this opportunity on the BCDevExchange. Click the apply button and submit your proposal by 4:00 PM Pacific Standard Time (PST) on September 19, 2017.

With your proposal, you must attach a copy of the Code-with-Us Terms, with the required information asked for in the "Acceptance" section of the Terms inserted into the document.

If we are satisfied with the proposals we receive, we will assign this opportunity by September 26, 2017 with work proposed to start immediately.

Proposals will be evaluated based on the following criteria:

➢ Expressed knowledge of NiFi provenance system that will employ best practices for processing provenance data efficiently as expressed in pseudocode / structures (30 points),
➢ Experience contributing Java code to any public code repositories with more than 5 contributors: (10 points),
➢ Experience contributing Java code to either of the following projects https://github.com/apache/nifi ,https://github.com/apache/incubator-atlas (10 points),
➢ Ability to deliver complete solution on or before November 7, 2017 (10 points).

@gjlawran
Copy link
Member Author

gjlawran commented Sep 6, 2017

Changed proposal submission instructions to use BCDevExchange App.

@gjlawran
Copy link
Member Author

gjlawran commented Sep 7, 2017

Added requirement to include a copy of Code-with-Us Terms to the How to Apply instructions.

@lmullane
Copy link

lmullane commented Sep 7, 2017

Hi everyone, we received a question regarding how this work will be awarded.

The answer is the Province's micro procurement process Code With Us. Read this for a quick overview and delve into the Wiki for more details on how Code With Us works.

Read and include a completed copy of the terms that you agree to when you submit a proposal to do the work.

Finally, proposals to be assigned the work will be assessed based on the evaluation criteria posted in the issue above.

If you have additional questions, post them here. We'll be happy to answer them.

Cheers,

Loren
The BCDevExchange team

@jujaga
Copy link
Member

jujaga commented Sep 15, 2017

We have learned from the NiFi-Dev Listserve (https://nifi.apache.org/mailing_lists.html) that a PR that includes https://github.com/ijokarumawak/nifi/tree/nifi-3709-2/nifi-nar-bundles/nifi-atlas-bundle might be helpful.

The following is a quick gap analysis of this code relative to our requirements:

  • We need support for GetFile and PutFile processors - GetFile should be a "Non-Obscure Ingress Processor" and PutFile should be a "Non-Obscure Egress Processor"
  • GetFile and PutFile should be able to handle and track parameters from runtime (e.g. dynamic filenames and paths)
  • We want to have GetFile and PutFile to use fs_path or a specialized fs_path data set type in Atlas in order to see the lineage between Ingress and Egress.
  • We want to have per-processor lineage detail in Atlas - the number of represented nodes should equal the number of processors in the NiFi flow.
  • We wish to be able to show detailed lineage from GetFile -> Processor -> ... -> Processor -> PutFile.
  • We would like to have a working example which can demonstrate the above (The current code did not produce nifi_data objects)​

@gjlawran
Copy link
Member Author

This opportunity was successfully awarded and completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants